Skip to content

get entire test suite working with offload#962

Open
evgunter wants to merge 59 commits intomainfrom
mng/offload
Open

get entire test suite working with offload#962
evgunter wants to merge 59 commits intomainfrom
mng/offload

Conversation

@evgunter
Copy link
Copy Markdown
Contributor

@evgunter evgunter commented Mar 25, 2026

it did some suspicious thing where it replaced all the mng llmdb calls to "speed them up" so that should definitely be checked

also was there some reason we didn't have a .dockerignore before? i vaguely remember something like this

doesn't actually switch over the local tests to default to offload--i want to make sure we have container retries before doing that

Summary

  • Configure test-results as a shared worktree symlink so offload baselines are shared
  • Update .offload-base-commit to tip of main (36f9e3b6d)
  • Add .dockerignore to prevent Modal upload races from .git/, __pycache__/, etc.
  • Change sandbox_init_cmd to use git apply + git add -A (keeps index in sync without the strictness of --index)
  • Add mng-test-runner non-root user to Dockerfile (Modal ignores USER directive, so it's unused for now)
  • Monkeypatch atomic_write in OSError test since Modal always runs as root
  • Auto-invalidate offload image cache when .offload-base-commit, Dockerfile, or offload-modal.toml changes
  • Call conversation_db.py directly instead of mng llmdb in chat.sh (eliminates ~10s startup per call)
  • Fix tunnel accept loop shutdown race (reduce accept timeout, increase join timeout)
  • Fix leaked processes: add AgentCreator.close() and call it in all creation tests, fix kanpan dispatch test, coordinator tests, forwarding server tests
  • Terminate mng events background processes in all stream manager CG tests
  • Wait for unison to start before stopping in pair test
  • Use file:///nonexistent-repo instead of https://github.com/test/repo in tests (fails instantly, no network hang)

Test plan

  • just test-offload passes with 0 failures -- 20 consecutive clean runs (136,660 test executions, 0 failures)
  • Verify worktree symlink: new agents from ~/mng get test-results/ symlinked
  • Verify cache invalidation: changing .offload-base-commit triggers image rebuild without --no-cache

🤖 Generated with Claude Code

evgunter and others added 12 commits March 24, 2026 16:25
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The work_dir_extra_paths table header was absorbing the
default_destroyed_host_persisted_seconds scalar field.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Update .offload-base-commit to tip of origin/main (36f9e3b) so the
  base image includes xxd and excludes deleted files like benchmark_branches.sh
- Use git apply --index in sandbox_init_cmd so the git index stays in sync
  with the working tree (fixes git ls-files listing deleted files)
- Add mng-test-runner user to Dockerfile for future use running tests as
  non-root (some tests assume filesystem permission checks work correctly)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Active agents modify .git/worktrees/*/FETCH_HEAD during git operations,
which races with Modal's image upload and causes "modified during build
process" errors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Git state is delivered via a tarball (current.tar.gz), not the live
.git dir. Files under .git/ are actively modified by running agents,
which races with Modal's upload.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Skip test_write_cli_completions_cache_handles_oserror when running as
  root (root bypasses filesystem permission checks, making the test invalid)
- Handle missing files in find_bash_scripts_without_strict_mode (git ls-files
  can list files that were deleted by git apply but not removed from the index)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Hash .offload-base-commit and the Dockerfile to detect when the cached
image is stale. Previously, changing the base commit required manually
passing --no-cache or deleting .offload-image-cache.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
git apply --index requires exact match between working tree and index,
which fails when the Dockerfile's COPY and tarball extraction leave
minor discrepancies. Instead, apply the patch normally then git add -A
to update the index to match the working tree.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Docker image only needs current.tar.gz from the build context.
Exclude __pycache__, .venv, test artifacts, and other directories that
are modified by concurrent processes during Modal's upload.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Reduce _ACCEPT_TIMEOUT_SECONDS from 1.0s to 0.2s so the tunnel accept
  loop checks the shutdown event more frequently
- Increase test join timeout from 3s to 10s for slow Modal containers
- Add UnisonSyncer.wait_for_started() and use it in
  test_pair_files_syncs_git_state_before_starting so the unison binary
  is actually invoked before the test stops the syncer

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
test_dispatch_immediate_custom_command_does_not_mark was using
"mng connect $MNG_AGENT_NAME" which blocks forever, leaking a
subprocess via the ThreadPoolExecutor. Use "true" instead since
the test only verifies dispatch routing (marks not set), not
command execution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@evgunter evgunter changed the title Get offload test suite working get entire test suite working with offload Mar 25, 2026
evgunter and others added 17 commits March 25, 2026 12:51
- Terminate mng events --follow processes before CG exit in
  test_stream_manager_mixed_local_and_remote to avoid 4s timeout flake
- Bump ChatTestEnv default timeout from 10s to 30s for slow containers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add AgentCreator.wait_for_completion() and call it in tests that
  start_creation() with an invalid repo, so the background git clone
  finishes before session cleanup runs
- Add process_manager.terminate_all() to coordinator tests that spawn
  handler subprocesses via process_tasks()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The mng CLI has heavy startup cost (~10s on Modal containers) due to
Python interpreter + Click + full module graph initialization. chat.sh
was calling mng llmdb for simple SQLite operations that only need
sqlite3 and sys.

Provision conversation_db.py alongside chat.sh and call it directly
with python3. This eliminates the timeout failures in offload.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Problem: _check_local_symlink_state was decorated with @pure, but it
performs filesystem I/O (is_symlink, resolve, exists) and its output
depends on filesystem state rather than only on its arguments.
Fix: Removed the @pure decorator since the function is not pure.
All other @pure functions in host.py are genuinely pure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Conflicts:
#	libs/mng/imbue/mng/hosts/host.py
# Conflicts:
#	libs/mng/imbue/mng/hosts/host.py
Modal's dockerignore may not match nested __pycache__/ dirs with a
bare pattern. Use **/__pycache__/ to be explicit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use file:///nonexistent-repo instead of https://github.com/test/repo
so git clone fails instantly rather than hanging on network. Also add
wait_for_completion calls to tests that fire-and-forget agent creation.

Also fix .dockerignore to use ** prefix for nested directory matching.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All 5 tests that use with manager._cg: and call _handle_discovery_line
now terminate background mng events processes before exiting the CG.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add chown to sandbox_init_cmd so mng-test-runner owns /code/mng
- Use runuser to run pytest as mng-test-runner
- Remove skipif(root) from completion_writer_test since tests no
  longer run as root
- Remove exists() band-aid from find_bash_scripts_without_strict_mode
  since git add -A keeps the index in sync

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sandbox_init_cmd is baked into the image, so changes to it need
to invalidate the cache.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
evgunter and others added 4 commits March 26, 2026 13:30
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Background threads from start_creation() leak git clone processes
and log errors that pollute other tests' captured output. Add close()
to join all threads, and call it in every test that triggers creation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…cked

This test calls start_creation() but never waited for the background
thread to finish, leaking a git clone process.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Conflicts:
#	.gitignore
#	libs/mng/imbue/mng/hosts/host.py
@evgunter evgunter marked this pull request as ready for review March 26, 2026 23:34
Comment on lines +195 to +197
creator.wait_for_completion(agent_id, timeout=10.0)

creator.wait_for_completion(agent_id, timeout=10.0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that looks weirdly duplicated


for process in manager._events_processes.values():
process.terminate()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these seem suss--isn't the whole point of the concurrency group to make sure that this stuff happens automatically?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should i bump the timeout for the concurrency group's garbage collection, then?

_SELECT_TIMEOUT_SECONDS: Final[float] = 1.0

_ACCEPT_TIMEOUT_SECONDS: Final[float] = 1.0
_ACCEPT_TIMEOUT_SECONDS: Final[float] = 0.2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why lower this? 200ms feels kinda tight for anything

Comment on lines +1552 to +1556
def _rsync_paths(
self,
source_host: OnlineHostInterface,
source_path: Path,
target_path: Path,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this diff must be weird and stacked on another PR?

Comment on lines +294 to +296
_new_conversation_id=$(mng llmdb poll-new "$_LLM_DB" "$_max_rowid")
_new_conversation_id=$(python3 "$_CONV_DB" poll-new "$_LLM_DB" "$_max_rowid")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggle to believe that these are right... pretty sure this is just going to break my stuff

Why did we need this?

@joshalbrecht
Copy link
Copy Markdown
Contributor

hmmm, some of this seems pretty suspicious...

evgunter and others added 19 commits March 27, 2026 02:48
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The merge with origin/main reintroduced the old mng-prefixed directories
alongside the new mngr-prefixed ones.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These files came from origin/main after the rename and still had
the old module path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
.mng/ was renamed to .mngr/ by the migration but the old directory
was re-added during the merge with origin/main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The ratchet catches any 'mng' not followed by 'r' in file contents.
The old directories are already removed from git and disk.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The constant controls how often the accept loop checks the shutdown
event, not how long clients have to connect. The old name was
misleading.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Our branch had stale merge artifacts: old execute_idempotent_command/
execute_stateful_command split, old _run_shell_command_with_transient_retry
method, and more granular SSH error handling. All of these were already
refactored on main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Conflicts:
#	libs/mngr/imbue/mngr/hosts/host.py
The previous conflict resolution accidentally deleted read_file,
write_file, read_text_file, write_text_file, and _get_file_mtime.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Modal ignores USER directives so this user was never used. The
OSError test is handled via monkeypatch instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants